-
Notifications
You must be signed in to change notification settings - Fork 228
Add Buffer.fill() method for cuMemsetAsync support (#1314) #1318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implements Buffer.fill(value, width, *, stream) method that wraps cuMemsetD8Async, cuMemsetD16Async, and cuMemsetD32Async based on the width parameter (1, 2, or 4 bytes). - Add fill() method to Buffer class in _buffer.pyx - Support width=1 (byte), width=2 (16-bit), width=4 (32-bit) - Validate width, value range, and buffer size divisibility - Add comprehensive tests in test_memory.py - Tests cover all widths, error cases, and verification Part of issue NVIDIA#1314: CUDA Graph phase 3 - memcpy nodes
Extend test_graph_alloc with 'fill' action parameter to test Buffer.fill() in graph capture mode. The test verifies graph capture for Buffer operations including copy_from, copy_to, fill, and kernel launch operations. Part of issue NVIDIA#1314
- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Improves performance by eliminating Python overhead
- Replace Python driver module calls with direct cydriver calls - Use 'with nogil:' blocks around CUDA driver API calls - Use HANDLE_RETURN macro for error handling - Cast stream to Stream type to access _handle attribute - Remove unused raise_if_driver_error import - Improves performance by eliminating Python overhead
|
/ok to test 7d9747d |
This comment has been minimized.
This comment has been minimized.
| cdef cydriver.CUstream s = s_stream._handle | ||
|
|
||
| # Validate width | ||
| if width not in (1, 2, 4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put the validation code closer to the top of the function so we avoid any setup work in the error case where the user passes an unsupported size to the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I fiddled with it to simplify the logic. To be honest, I don't see a big improvement, here, since most of the preceding statements just declare stack variables.
| buffer1.fill(0x42, width=1, stream=stream) | ||
| device.sync() | ||
|
|
||
| if check: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are parametrizing these value checks in a test suite? The memory sizes don't strike me as so large that these operations would be slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values are only checked when the memory allocation is pinned. This follows the existing pattern.
rparolin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good generally, just a couple comments.
- Add _validate_value_against_bitwidth helper function - Move helper function to end of file as cdef function - Use 64-bit platform integers (int64_t/uint64_t) instead of Python ints - Add assertion that bitwidth < 64 - Remove magic numbers from fill() method - Update tests to match new error message format
|
/ok to test 8e2cddf |
|
cc @pciolkosz who implemented CCCL's I think something what would make more sense would be for
|
|
|
||
| # Map width (bytes) to bitwidth and validate value | ||
| cdef int bitwidth = width * 8 | ||
| _validate_value_against_bitwidth(bitwidth, value, is_signed=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit of unexpected behavior that this will throw if someone passes a negative integer
|
|
||
| # Helper Functions | ||
| # ---------------- | ||
| cdef void _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except *: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we should probably make this an inline function for performance reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and avoid except*, so it should return an int instead, something like
| cdef void _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except *: | |
| cdef int _validate_value_against_bitwidth(int bitwidth, int64_t value, bint is_signed=False) except?-1: |
| if is_signed: | ||
| min_value = -(<int64_t>1 << (max_bits - 1)) | ||
| max_value = (<int64_t>1 << (max_bits - 1)) - 1 | ||
| else: | ||
| min_value = 0 | ||
| max_value_unsigned = (<uint64_t>1 << max_bits) - 1 | ||
| max_value = <int64_t>max_value_unsigned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: getting the min and max values here can be cimported from the built in libc module: https://github.com/cython/cython/blob/master/Cython/Includes/libc/limits.pxd
i.e. from libc.limits cimport INT_MAX
| value : int | ||
| Integer value to fill the buffer with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the underlying driver APIs expect an integer, but I think for Buffer.fill() it would be good to support a byte-like input as well, maybe via buffer protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm my understanding, I believe the suggestion here is to accept arguments of type bytes or that provide bytes via the buffer protocol (where the number of bytes equals the buffer size) and fill our Buffer from that. Would that be better to implement as Buffer.copy_from rather than Buffer.fill? Or possibly a new function Buffer.copy_from_bytes or Buffer.copy_from_host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal logic will be different from Buffer.fill, since it requires copying the bytes to a staging area, unlike cuMemset*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to allowing a user to pass a memoryview object of 1, 2, or 4 bytes in length to be used as the fill value. I.e. Buffer.fill(b'abcd').
|
From #1314 (comment):
and #1318 (comment):
My thinking was: We follow what we did for def fill(self, value, stream: Stream | GraphBuilder):
...and the user code can be: buf.fill(np.int16(8), s)This would provide a uniform UX across cuda.core APIs that accept Python/NumPy scalars. @kkraus14 @Andy-Jost WDYT? |
|
My bad, I didn't check if you had auto-merged enabled. @Andy-Jost please make sure any feedback from Keith and Leo is incorporated. |
|
|
No worries. Tracking this in #1345. |
Will follow up to fix, no worries. |
Summary
This PR implements
Buffer.fill(value, width, *, stream)method that wraps CUDA'scuMemsetAsyncfunctions, supportingcuMemsetD8Async,cuMemsetD16Async, andcuMemsetD32Asyncbased on the width parameter.Part of issue #1314: CUDA Graph phase 3 - memcpy nodes
Changes
Added
Buffer.fill()method incuda/core/experimental/_memory/_buffer.pyx:cuMemsetD8Async)cuMemsetD16Async)cuMemsetD32Async)copy_to/copy_frommethodsAdded comprehensive tests in
tests/test_memory.py:Implementation Details
The method automatically selects the appropriate CUDA driver API function based on the
widthparameter:width=1: UsescuMemsetD8Asyncwith N = buffer_size (bytes)width=2: UsescuMemsetD16Asyncwith N = buffer_size // 2 (16-bit elements)width=4: UsescuMemsetD32Asyncwith N = buffer_size // 4 (32-bit elements)Example Usage
Testing
All tests pass with comprehensive coverage of: